add mimalloc #9

rob-p · 2024-05-29T01:30:35Z

I created this branch to play around with potential optimizations. However, right now this PR just contains one (very trivial) optimization. It replaces the global allocator with Mimalloc. This is a highly-optimized allocator that generally outperforms the system malloc. Given the propensity of Granges to perform many small allocations, such a change seems to help. I'd be very interested in your measurements!

jianshu93 · 2024-05-31T14:23:05Z

Hi Rob,

I saw 5% to 10% improvement at least using my random test dataset. I think it is worth it. Thanks for bring the mimalloc into attention. I have some problem with jemalloc but it seems mimalloc is perfectly compatible with system default malloc while being faster.

Thanks,

Jianshu

molpopgen · 2024-05-31T15:01:00Z

I would suggest that binaries including granges should define their global allocators in the file containing main. Unless things have changed, you can run into a mess when library crates define global allocators. For example, if a Vec<_> from another crate were passed to granges, they may not be compatible due to different global allocatorrs.

jianshu93 · 2024-05-31T15:20:01Z

Yes exactly. The best way is to add a feature like use_mimalloc. So that users can choose whehter they want to use it or not. I have tried in other crates with very frequent heap allocation. Let me know if this makes sense.

Thanks,

Jianshu

molpopgen · 2024-05-31T15:40:45Z

There is a matter of taste/style here. The other allocator may improve the performance of the current binary. If that appeared generally true (e..g, not just on macos or whatever), then maybe it should not be feature gated for the binary. The reason is that, imagine allocator A is best on some hardware, and B is best on another. Once you feature-gate in mimalloc ("A"), you cannot add a feature in for B, because you break the expectation that features are strictly additive.

rob-p · 2024-05-31T16:52:32Z

These are good points. I think the claim is that it makes sense to add the global allocator to the granges binary, and then suggest (e.g. in the README and documentation) that consumers of the library might consider adopting this allocator for best performance.

One thing that I wanted to mention to @jianshu93 is that, while mimalloc provides a nice bump in the current testing, the gains are often even larger in the multi-threaded context — it's improvement in allocation under thread contention is even greater.

@molpopgen: I don't think the improvements of mimalloc over the standard allocator are particularly hardware dependent. Their benchmarks (and my experience) suggest its generally faster at the same basic memory usage.

jianshu93 · 2024-05-31T17:50:35Z

Hello both,

I actually test it via several different machines in a parallel setting. Intel(R) Xeon(R) Gold 6226, AMD EPYC 7713, AMD EPYC 9534 . The performance bump is more pronounced on the most recent AMD EPYC 9534 for several crate, e.g, graph (skip list, well-known for current/parallel) as the underlying structure and also this crate. I never saw a 20% improvement or something, most around 8% to 15%. Should be interesting to test more different data structure because the mimalloc paper reported only for several give structures.

Thanks,

Jianshu

add mimalloc

b71c9a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add mimalloc #9

add mimalloc #9

rob-p commented May 29, 2024

jianshu93 commented May 31, 2024

molpopgen commented May 31, 2024

jianshu93 commented May 31, 2024

molpopgen commented May 31, 2024

rob-p commented May 31, 2024

jianshu93 commented May 31, 2024

add mimalloc #9

Are you sure you want to change the base?

add mimalloc #9

Conversation

rob-p commented May 29, 2024

jianshu93 commented May 31, 2024

molpopgen commented May 31, 2024

jianshu93 commented May 31, 2024

molpopgen commented May 31, 2024

rob-p commented May 31, 2024

jianshu93 commented May 31, 2024